skip to main content


Search for: All records

Creators/Authors contains: "Yuan, Xu"

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

  1. Serverless computing has been favored by users and infrastructure providers from various industries, including online services and scientific computing. Users enjoy its auto-scaling and ease-of-management, and providers own more control to optimize their service. However, existing serverless platforms still require users to pre-define resource allocations for their functions, leading to frequent misconfiguration by inexperienced users in practice. Besides, functions' varying input data further escalate the gap between their dynamic resource demands and static allocations, leaving functions either over-provisioned or under-provisioned. This paper presents Libra, a safe and timely resource harvesting framework for multi-node serverless clusters. Libra makes precise harvesting decisions to accelerate function invocations with harvested resources and jointly improve resource utilization by profiling dynamic resource demands and availability proactively. Experiments on OpenWhisk clusters with real-world workloads show that Libra reduces response latency by 39% and achieves 3X resource utilization compared to state-of-the-art solutions. 
    more » « less
    Free, publicly-accessible full text available August 7, 2024
  2. Quantum computers in the current noisy intermediate-scale quantum (NISQ) era face two major limitations - size and error vulnerability. Although quantum error correction (QEC) methods exist, they are not applicable at the current size of computers, requiring thousands of qubits, while NISQ systems have nearly one hundred at most. One common approach to improve reliability is to adjust the compilation process to create a more reliable final circuit, where the two most critical compilation decisions are the qubit allocation and qubit routing problems. We focus on solving the qubit allocation problem and identifying initial layouts that result in a reduction of error. To identify these layouts, we combine reinforcement learning with a graph neural network (GNN)-based Q-network to process the mesh topology of the quantum computer, known as the backend, and make mapping decisions, creating a Graph Neural Network Assisted Quantum Compilation (GNAQC) strategy. We train the architecture using a set of four backends and six circuits and find that GNAQC improves output fidelity by roughly 12.7% over pre-existing allocation methods. 
    more » « less
    Free, publicly-accessible full text available June 5, 2024
  3. Serverless computing automates fine-grained resource scaling and simplifies the development and deployment of online services with stateless functions. However, it is still non-trivial for users to allocate appropriate resources due to various function types, dependencies, and input sizes. Misconfiguration of resource allocations leaves functions either under-provisioned or over-provisioned and leads to continuous low resource utilization. This paper presents Freyr, a new resource manager (RM) for serverless platforms that maximizes resource efficiency by dynamically harvesting idle resources from over-provisioned functions to under-provisioned functions. Freyr monitors each function’s resource utilization in real-time, detects over-provisioning and under-provisioning, and learns to harvest idle resources safely and accelerates functions efficiently by applying deep reinforcement learning algorithms along with a safeguard mechanism. We have implemented and deployed a Freyr prototype in a 13-node Apache OpenWhisk cluster. Experimental results show that 38.8% of function invocations have idle resources harvested by Freyr, and 39.2% of invocations are accelerated by the harvested resources. Freyr reduces the 99th-percentile function response latency by 32.1% compared to the baseline RMs. 
    more » « less
  4. The proliferation of IoT devices, with various capabilities in sensing, monitoring, and controlling, has prompted diverse emerging applications, highly relying on effective delivery of sensitive information gathered at edge devices to remote controllers for timely responses. To effectively deliver such information/status updates, this paper undertakes a holistic study of AoI in multi-hop networks by considering the relevant and realistic factors, aiming for optimizing information freshness by rapidly shipping sensitive updates captured at a source to its destination. In particular, we consider the multi-channel with OFDM (orthogonal frequency-division multiplexing) spectrum access in multi-hop networks and develop a rigorous mathematical model to optimize AoI at destination nodes. Real-world factors, including orthogonal channel access, wireless interference, and queuing model, are taken into account for the very first time to explore their impacts on the AoI. To this end, we propose two effective algorithms where the first one approximates the optimal solution as closely as we desire while the second one has polynomial time complexity, with a guaranteed performance gap to the optimal solution. The developed model and algorithms enable in-depth studies on AoI optimization problems in OFDM-based multi-hop wireless networks. Numerical results demonstrate that our solutions enjoy better AoI performance and that AoI is affected markedly by those realistic factors taken into our consideration. 
    more » « less
  5. With the growing effort to reduce power consumption in machines, fault tolerance becomes more of a concern. This holds particularly for large-scale computing, where execution failures due to soft faults waste excessive time and resources. These large-scale applications are normally parallel in nature and rely on control structures tailored specifically for parallel computing, such as locks and barriers. While there are many studies on resilient software, to our knowledge none of them focus on protecting these parallel control structures. In this work, we present a method of ensuring the correct operation of both locks and barriers in parallel applications. Our method tracks the memory locations used within parallel sections and detects a violation of the control structures. Upon detecting any violation, the violating thread is rolled back to the beginning of the structure and reattempts it, similar to rollback mechanisms in transactional memory systems. We test the method on representative samples of the BigDataBench kernels and find it exhibits a mean error reduction of 93.6% for basic mutex locks and barriers with a mean 6.55% execution time overhead at 64 threads. Additionally, we provide a comparison to transactional memory methods and demonstrate up to a mean 57.5% execution time overhead reduction. 
    more » « less